Theory and Practice
install.packages(c("kernlab", "magick", "tesseract", "pdftools"))
Give the following a try, but move on if it does not work!
install.packages(c("rJava", "tabulizer"))
If it does not work, look at this github page when on your own machine.
So far, we have been in the world of text made with the sole intention of displaying it on a screen. That is, however, just a very small slice of the text that is out in the world. Take a pdf file, for instance. We would imagine that it has been generated for appearing on a screen, but it is a vastly different animal than our traditional web-based print.
Just like our old friends decision trees and random forest, SVM is a technique that can prove to be useful for both regression and classification. In the context of text, we will use SVM’s classification abiliities.
SVM is popular because it is both powerful and conceptually easy to understand. Since we are dealing with a classifier at heart, all we are doing is defining a hyperplane that will separate data in (hyper)dimensional space. Easy, right?
Let’s look at the following plot with 2 classes:
Now we can get to those hyperplanes. We need to define that margin so that it does a few things:
It needs to separate the points properly based upon the classifications.
It needs to maximize the amount of distance between itself and the groups (this is called the margin).
Let’s look at some candidate lines:
pointPlot +
geom_abline(intercept = 0, color = "red") +
geom_abline(intercept = 1, color = "blue") +
geom_abline(intercept = 2, color = "green") +
geom_abline(intercept = 3, color = "black")
So to fulfill our first condition, we can immediately rule out the red line. We now have three lines that might work:
pointPlot +
geom_abline(intercept = 1, color = "blue") +
geom_abline(intercept = 2, color = "green") +
geom_abline(intercept = 3, color = "black")
With the remaining lines, which one maximizes the margin between the classes?
pointPlot +
geom_abline(intercept = 2, color = "green")
That looks like an appropriate hyperplane, but what are the support vectors? They are the individual vectors (observations) that help to define the hyperplane!
R has many classic packages, with e1071 being one of them. For our work in caret, will play with kernlab.
library(caret)
library(dplyr)
library(kernlab)
library(rsample)
attrition = attrition %>%
mutate_at(c("JobLevel", "StockOptionLevel", "TrainingTimesLastYear"), factor)
set.seed(1001)
split = initial_split(attrition, prop = .6, strata = "Attrition")
attritionTrain = training(split)
attritionTest = testing(split)
svmTrainControl = trainControl(method = "cv", number = 10, verboseIter = FALSE)
nbAttrition = train(Attrition ~ ., data = attritionTrain,
method = "svmLinear", trControl = svmTrainControl, metric = "Accuracy")
nbAttrition
Support Vector Machines with Linear Kernel
883 samples
30 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 794, 795, 794, 794, 795, 795, ...
Resampling results:
Accuracy Kappa
0.8720633 0.4687678
Tuning parameter 'C' was held constant at a value of 1
We see that we have a tuning parameter called C (Cost). Remember the planes that we drew earlier, where the goal was to correctly classify and maximize the margins? We cannot always do both in reality and the C parameter indicates which we prefer. Higher C values will push us towards correct classification over maximizing the margins. It essentially serves as a regularization parameter. The values that the C parameter can be very small (think .000001) to very large (1000).
svmAttrition = train(Attrition ~ ., data = attritionTrain,
method = "svmLinear", trControl = svmTrainControl,
tuneGrid = data.frame(C = c(.001, .1, .5, 1, 5, 10, 100)),
metric = "Accuracy", preProc = c("center", "scale"))
A C parameter of .1 tends to get us the best accuracy, but they are all pretty close to each other.
svmAttritionTuned = train(Attrition ~ ., data = attritionTrain,
method = "svmLinear", trControl = svmTrainControl,
tuneGrid = data.frame(C = .1),
metric = "Accuracy", preProc = c("center", "scale"))
confusionMatrix(svmAttritionTuned)
Cross-Validated (10 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction No Yes
No 80.6 8.9
Yes 3.2 7.2
Accuracy (average) : 0.8788
Now, we can test our model:
svmAttritionTest = predict(svmAttritionTuned, attritionTest)
confusionMatrix(svmAttritionTest, attritionTest$Attrition)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 482 47
Yes 11 47
Accuracy : 0.9012
95% CI : (0.8741, 0.9241)
No Information Rate : 0.8399
P-Value [Acc > NIR] : 1.172e-05
Kappa : 0.5653
Mcnemar's Test P-Value : 4.312e-06
Sensitivity : 0.9777
Specificity : 0.5000
Pos Pred Value : 0.9112
Neg Pred Value : 0.8103
Prevalence : 0.8399
Detection Rate : 0.8211
Detection Prevalence : 0.9012
Balanced Accuracy : 0.7388
'Positive' Class : No
I think it is fair to say that this model performed better than our simple neural network and our Naive Bayes. While we could take it and apply it back to the lyrics stuff, we are going to do something a little bit different on the text front.
Instead of looking at text as a collection of words that might have some meaning, we are going to look at the shape of text – in other words, the letters. Why might we do this? For starters, we can think again about an aforementioned PDF or other written text.
library(rvest)
letterData = readr::read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data",
col_names = FALSE)
letterName = read_html("https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.names") %>%
html_text() %>%
stringr::str_replace_all(., "\t", " ") %>%
readr::read_lines() %>%
stringr::str_extract(., "(?<=^\\s[0-9]{1,2}\\.\\s)\\w+-*\\w*|(?<=^\\s\\s[0-9]{1,2}\\.\\s)\\w+-*\\w*") %>%
na.omit()
names(letterData) = letterName
rmarkdown::paged_table(letterData)
This is classic letter data, in which the shape of letters has been broken down to 16 features.
We can go through our now routine data-prep steps:
splitLetter = initial_split(letterData, prop = .8, strata = "lettr")
letterTrain = training(splitLetter)
letterTest = testing(splitLetter)
svmTrainControl = trainControl(method = "cv", number = 10, verboseIter = FALSE)
svmLetterLinear = train(lettr ~ ., data = letterTrain, method = "svmLinear",
trControl = svmTrainControl, metric = "Accuracy",
preProcess = c("center", "scale"),
tuneGrid = data.frame(C = c(.0001, .001, 1, 10, 100)))
svmLetterLinear
Support Vector Machines with Linear Kernel
16001 samples
16 predictor
26 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
Pre-processing: centered (16), scaled (16)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 14403, 14401, 14399, 14404, 14399, 14399, ...
Resampling results across tuning parameters:
C Accuracy Kappa
1e-04 0.1000570 0.06205329
1e-03 0.6384001 0.62384509
1e+00 0.8523854 0.84647320
1e+01 0.8509482 0.84497894
1e+02 0.8507632 0.84478637
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was C = 1.
svmLetterTest = predict(svmLetterLinear, letterTest)
confusionMatrix(svmLetterLinear, letterTest$lttr)
Cross-Validated (10 fold) Confusion Matrix
(entries are un-normalized aggregated counts)
Reference
Prediction A B C D E F G H I J K L M N
A 590 3 0 1 0 0 2 0 2 12 0 0 4 2
B 0 546 0 25 5 5 1 12 3 1 2 1 8 2
C 2 0 514 0 10 0 28 10 1 0 9 2 0 1
D 2 8 0 567 0 5 18 31 8 6 6 6 0 10
E 0 4 17 0 532 5 4 1 1 1 7 23 0 0
F 0 1 1 1 4 537 5 12 11 4 0 0 0 1
G 2 8 14 2 25 6 490 9 1 0 11 13 1 0
H 0 11 4 5 0 1 5 388 0 3 5 5 11 12
I 0 2 0 0 1 3 0 0 528 33 0 0 0 0
J 3 1 0 2 0 3 1 3 19 515 0 0 0 0
K 4 1 23 2 3 1 10 21 0 0 483 3 1 1
L 1 0 1 0 3 0 7 1 3 0 4 521 0 0
M 4 0 2 2 0 0 3 0 0 0 1 0 591 9
N 1 1 0 7 0 5 0 2 0 3 0 0 3 554
O 0 1 7 7 0 0 2 32 0 2 0 0 1 4
P 0 2 0 3 0 12 2 4 1 0 0 0 0 1
Q 1 0 0 1 9 0 18 7 0 0 1 9 0 0
R 3 25 0 12 3 0 7 41 0 1 44 1 6 3
S 3 6 1 1 17 12 16 0 13 7 1 4 0 0
T 0 0 1 0 8 17 0 0 0 0 1 0 0 0
U 0 0 4 6 0 1 1 7 0 1 2 0 7 3
V 1 1 0 0 0 0 7 2 0 0 1 1 0 3
W 0 0 0 0 0 0 4 0 0 0 1 0 6 1
X 0 4 0 0 2 0 1 6 5 1 16 6 0 0
Y 5 0 0 0 0 2 0 1 1 0 0 0 0 0
Z 0 0 0 1 6 0 0 0 7 10 0 0 0 0
Reference
Prediction O P Q R S T U V W X Y Z
A 9 0 11 6 2 1 5 4 1 0 1 3
B 0 2 7 21 28 0 1 8 0 2 0 1
C 7 0 1 0 1 1 1 0 0 0 0 0
D 18 3 3 12 1 2 2 0 0 7 2 1
E 0 0 13 2 28 5 0 0 0 11 0 15
F 0 48 0 1 14 7 0 0 0 3 13 4
G 4 14 22 7 14 11 1 3 1 3 0 0
H 50 2 4 17 0 8 5 5 4 1 2 0
I 0 0 0 0 11 0 0 0 0 7 1 2
J 2 1 3 0 3 0 0 0 0 5 0 24
K 0 3 1 26 0 5 5 1 0 16 0 0
L 0 0 3 0 9 0 0 0 0 7 0 0
M 3 0 0 4 0 0 9 1 20 0 1 0
N 1 0 0 5 0 0 3 0 4 0 1 0
O 485 3 18 4 0 0 4 0 2 2 0 0
P 4 546 0 0 0 3 0 3 0 0 4 0
Q 9 1 497 1 12 1 0 0 0 0 5 5
R 6 1 3 489 3 2 0 3 2 3 0 0
S 0 1 31 0 424 9 0 0 0 2 3 55
T 1 0 0 0 6 577 2 1 0 3 13 4
U 6 1 0 0 0 1 604 1 2 2 3 0
V 1 3 1 2 0 0 1 552 4 0 18 0
W 20 1 1 1 0 0 4 14 550 0 0 0
X 1 0 0 2 3 3 0 0 0 532 4 0
Y 0 12 2 0 1 6 0 10 0 3 539 0
Z 0 0 2 0 44 12 0 0 0 1 0 488
Accuracy (average) : 0.8524
prop.table(table(svmLetterTest == letterTest$lettr))
FALSE TRUE
0.1452863 0.8547137
This exercise leads us to a really interesting point about our SVM models – the mapping of the support vectors. As they have been, we are trying to map our lines in linear (multidimensional) space.
library(plotly)
plot_ly(letterData, x = ~`x-box`, y = ~`y-box`, z = ~onpix, color = ~lettr,
type= "scatter3d", mode = "markers")
That is going to be a nope for me. We are dealing with only 3 dimensions (out of several more) and I am not entirely convinced that we could find linear lines to do that splitting for us (eyes cannot, but a machine might be able to).
Instead, we can do some kernel transformations for the SVD. This kernel transformation can take our non-separable linear space, transform it to even higher dimensional space, and then achieve linear separation. This improved separation in data might lead to better predictions. The particular type of kernel we will use for our SVM is a radial basis function.
We will also add in a new tuning parameter: sigma. Sigma is going to control the smoothness of the decision boundary. Since we are dealing in nonlinear space, the decision boundary can be very choppy (low values of sigma and likely to overfit – essentially a local classifier) or very smooth (higher values of sigma and more likely to produce training errors – a global classifier).
tuningGrid = expand.grid(C = c(1, 5, 10),
sigma = c(.5, 1, 5, 10))
svmLetterRBF = train(lettr ~ ., data = letterTrain, method = "svmRadial",
trControl = svmTrainControl, metric = "Accuracy",
preProcess = c("center", "scale"),
tuneGrid = tuningGrid)
load("C://Users/sberry5/Documents/teaching/courses/unstructured/data/svmLetterRBF.RData")
svmLetterRBF
Support Vector Machines with Radial Basis Function Kernel
16001 samples
16 predictor
26 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
Pre-processing: centered (16), scaled (16)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 14401, 14397, 14400, 14402, 14401, 14400, ...
Resampling results across tuning parameters:
C sigma Accuracy Kappa
1 0.5 0.9638810 0.9624345
1 1.0 0.9426304 0.9403323
1 5.0 0.4652763 0.4432935
1 10.0 0.2475390 0.2160150
5 0.5 0.9666930 0.9653592
5 1.0 0.9440058 0.9417629
5 5.0 0.5002744 0.4797681
5 10.0 0.2761631 0.2458865
10 0.5 0.9669428 0.9656190
10 1.0 0.9440056 0.9417627
10 5.0 0.5002744 0.4797681
10 10.0 0.2761631 0.2458865
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were sigma = 0.5 and C = 10.
It appears that a higher Cost, coupled with a lower sigma, leads to the best prediction
svmLetterRBFTest = predict(svmLetterRBF, letterTest)
confusionMatrix(svmLetterLinear, letterTest$lttr)
Cross-Validated (10 fold) Confusion Matrix
(entries are un-normalized aggregated counts)
Reference
Prediction A B C D E F G H I J K L M N
A 590 3 0 1 0 0 2 0 2 12 0 0 4 2
B 0 546 0 25 5 5 1 12 3 1 2 1 8 2
C 2 0 514 0 10 0 28 10 1 0 9 2 0 1
D 2 8 0 567 0 5 18 31 8 6 6 6 0 10
E 0 4 17 0 532 5 4 1 1 1 7 23 0 0
F 0 1 1 1 4 537 5 12 11 4 0 0 0 1
G 2 8 14 2 25 6 490 9 1 0 11 13 1 0
H 0 11 4 5 0 1 5 388 0 3 5 5 11 12
I 0 2 0 0 1 3 0 0 528 33 0 0 0 0
J 3 1 0 2 0 3 1 3 19 515 0 0 0 0
K 4 1 23 2 3 1 10 21 0 0 483 3 1 1
L 1 0 1 0 3 0 7 1 3 0 4 521 0 0
M 4 0 2 2 0 0 3 0 0 0 1 0 591 9
N 1 1 0 7 0 5 0 2 0 3 0 0 3 554
O 0 1 7 7 0 0 2 32 0 2 0 0 1 4
P 0 2 0 3 0 12 2 4 1 0 0 0 0 1
Q 1 0 0 1 9 0 18 7 0 0 1 9 0 0
R 3 25 0 12 3 0 7 41 0 1 44 1 6 3
S 3 6 1 1 17 12 16 0 13 7 1 4 0 0
T 0 0 1 0 8 17 0 0 0 0 1 0 0 0
U 0 0 4 6 0 1 1 7 0 1 2 0 7 3
V 1 1 0 0 0 0 7 2 0 0 1 1 0 3
W 0 0 0 0 0 0 4 0 0 0 1 0 6 1
X 0 4 0 0 2 0 1 6 5 1 16 6 0 0
Y 5 0 0 0 0 2 0 1 1 0 0 0 0 0
Z 0 0 0 1 6 0 0 0 7 10 0 0 0 0
Reference
Prediction O P Q R S T U V W X Y Z
A 9 0 11 6 2 1 5 4 1 0 1 3
B 0 2 7 21 28 0 1 8 0 2 0 1
C 7 0 1 0 1 1 1 0 0 0 0 0
D 18 3 3 12 1 2 2 0 0 7 2 1
E 0 0 13 2 28 5 0 0 0 11 0 15
F 0 48 0 1 14 7 0 0 0 3 13 4
G 4 14 22 7 14 11 1 3 1 3 0 0
H 50 2 4 17 0 8 5 5 4 1 2 0
I 0 0 0 0 11 0 0 0 0 7 1 2
J 2 1 3 0 3 0 0 0 0 5 0 24
K 0 3 1 26 0 5 5 1 0 16 0 0
L 0 0 3 0 9 0 0 0 0 7 0 0
M 3 0 0 4 0 0 9 1 20 0 1 0
N 1 0 0 5 0 0 3 0 4 0 1 0
O 485 3 18 4 0 0 4 0 2 2 0 0
P 4 546 0 0 0 3 0 3 0 0 4 0
Q 9 1 497 1 12 1 0 0 0 0 5 5
R 6 1 3 489 3 2 0 3 2 3 0 0
S 0 1 31 0 424 9 0 0 0 2 3 55
T 1 0 0 0 6 577 2 1 0 3 13 4
U 6 1 0 0 0 1 604 1 2 2 3 0
V 1 3 1 2 0 0 1 552 4 0 18 0
W 20 1 1 1 0 0 4 14 550 0 0 0
X 1 0 0 2 3 3 0 0 0 532 4 0
Y 0 12 2 0 1 6 0 10 0 3 539 0
Z 0 0 2 0 44 12 0 0 0 1 0 488
Accuracy (average) : 0.8524
prop.table(table(svmLetterTest == letterTest$lettr))
FALSE TRUE
0.1452863 0.8547137
Knowing how one might perform OCR with SVM is important, but you probably won’t ever need to go down that road. Why? There are already great tools that will do the work for you. Probably the easiest one to work with is Tesseract. Tesseract has been supported by Google for several years now and continues to make great leaps. A combination of ImageMagick, Tesseract, and pdftools will handle just about any image-based data extraction needs that you might have.
library(magick)
# library(pdftools)
library(tesseract)
loveCraftLetter = ocr("C://Users/sberry5/Documents/teaching/courses/unstructured/data/lovecraftLetter.jpg")
cat(loveCraftLetter)
l-s- a-m ,or’aém‘d «~6er w/l m4,” m7: m»; I .u 464;.
F m) MU»; _l‘ M 7 Hum , mi; 7’44-‘94 # wZZA-M £494.“
+.. 3.4146. WW Hmh/r.-. 72“.“
9% 7 ‘ 7 r: S‘H'JVM‘L, 504‘ 598 Angegbst” Pgovigeiice, R.I.,
eel 1.7.7::st (Jl'Abwaaw‘uy<U+FB02>‘ 4' eruary '12
My dear Mr; Hennebergerz-
E I was very glad to hear from you, end to receive
so many sidelights on \‘EIRD TALES, whose chosen field makes me very eager for
its succees. I lmow the financial end of magazine publishing must 'be a.
tremendous and often discouraging responsibility, and I have a sincere
respect for the pluck and determination of anybody who unde rtakea such a.
venture. most certainly do‘I hope that some favourable turn will gradually
transform your buldensome debt on the two magazines into an increasingly
gratifying profitéuand it seems to me that many facts warrant such optimism,
for in the weird field you are practically alone and with a. good start,
whilst in the detective field there sees to be an insatiable demand for new
material. Still, I know that marketing is e. venturesome and uncertain
process---espec1ally with dealers in the unscrupulous state of mind you
describe!
I assure you that I was not at all disconcerted by the presence of "The
Transparent Ghost" beside my "Hound". In-the first place, I don't take
myself too seriously; and in the second place Izean appreciate the sort of
humour involved in such touches of "comic relief"---like the gravedigger 1n
"Hamlet" or the porter in "Macbeth“. When .5. magazine covers a popular
clientele and appeals to one particular interest, it is peculiarly apt to
elicit literaryu-or more or lees»1iteraryv--contributions from its readers;
so that I suppose a very large proportion of those who have seen W'EIRJR
TALES have <U+FB02>ooded the office with unacceptable manuscripts, To them the
whole subject of impossible contributions has become a live issue, so that
the exploitation of some comically illiterate attempt carries a piquency
which they can feel and smile at even though others may find it somewhat
tedious and inapropos. "The Transparent Ghost" may not be an austerely
literary asset, yet I cannot doubt but that it will make many friends for the
magazine, and perhaps assuage more than one subtle sting left behind by
rej ected MES.
I hope, anyway, that this matter won't be instrumental in deposing Mr.
Baird from the editorship until he is himself ready to relinquish it; for I
feel that he must have done very well on the whole, considering the adverse
canditions encountered in the quest for really weird stories. That he could
get hold of as many as five perfectly satisfactory yarns is an almost
remarkable phenomenon in view of the lack of truly artistic and individual
expression among professional fiction-writere. When I see a magazine tending
toward the commonplace, the last people I blame are the editors and publishers)-
for eveh a cursory survey of the protessionel writing field shows that the
trouble is something infinitely deeper and wider---—someth1ng concerning no
one publication, but the whole atmosphere and temperament of theAAmerican
fiction business. And even when I get to such large units as this, I can't
be any too savage about the b1aming---beeause I realise that much of the
trouble is absolutely inevitable———as incapable of human rmedy as the fate
of any protagonist 1n the creek drama. Here 111 America, we have a very
conventional and hnlf-educeted public---a public trained under one phase or
another of the Puritan tradition, and almost dulled to aesthetic aensitiveneas
because of the monotonous and omnipresent overstreesing of the ethical
element. We have millions who lack the intellectual independence courage,
and <U+FB02>exibility to get an artistic thrill out or a bizarre situation, and.
who enter sympathetically into a story only when it ignores the colour and
vividnese of actual human emotions and conventionally presents a simple
plot based on artificial, ethically sugar-coated values and leading to a
<U+FB02>at denouement which shall vindicate every current platitx tie and. leave no
mystery unexplained by the shallow comprehension of the most mediocre reader.
letterConfidence = ocr_data("C://Users/sberry5/Documents/teaching/courses/unstructured/data/lovecraftLetter.jpg")
letterConfidence
# A tibble: 676 x 3
word confidence bbox
<chr> <dbl> <chr>
1 l-s- 69.7 17,6,67,30
2 a-m 61.8 85,9,169,27
3 ,or’aém‘d 60.8 184,6,300,37
4 «~6er 61.7 314,6,453,29
5 w/l 60.9 471,9,523,31
6 "m4,â€\u009d" 48.3 541,12,632,34
7 m7: 69.1 693,16,765,36
8 m»; 51.8 792,18,901,35
9 I 45.8 925,27,940,36
10 .u 64.2 994,18,1042,36
# ... with 666 more rows
That is pretty good on its own. This document also gives us some insight into the difficulties between OCR and ICR (intelligent character recognition). Machines do a great job with machine-generated text, but have a tougher time (out of the gate) with handwritten text.
If this image is a data, and all data needs cleaned, then let’s clean up our data for hopefully improved results.
loveCraftLetter = image_read("C://Users/sberry5/Documents/teaching/courses/unstructured/data/lovecraftLetter.jpg")
cleanedLetter = loveCraftLetter %>%
image_resize("2000x") %>%
image_convert(type = "Grayscale") %>%
image_trim(fuzz = 50) %>% # This should sharpen the images (i.e., letters)
image_write(format = "png", density = "600x600", quality = 100)
cleanedRead = ocr(cleanedLetter)
cleanedConfidence = ocr_data(cleanedLetter)
Certainly better on some characters/words.
Next time, we are going to talk tabulizer…hopefully!